Data-driven models for atmospheric air temperature forecasting at a continental climate region

Atmospheric air temperature is the most crucial metrological parameter. Despite its influence on multiple fields such as hydrology, the environment, irrigation, and agriculture, this parameter describes climate change and global warming quite well. Thus, accurate and timely air temperature forecasting is essential because it provides more important information that can be relied on for future planning. In this study, four Data-Driven Approaches, Support Vector Regression (SVR), Regression Tree (RT), Quantile Regression Tree (QRT), ARIMA, Random Forest (RF), and Gradient Boosting Regression (GBR), have been applied to forecast short-, and mid-term air temperature (daily, and weekly) over North America under continental climatic conditions. The time-series data is relatively long (2000 to 2021), 70% of the data are used for model calibration (2000 to 2015), and the rest are used for validation. The autocorrelation and partial autocorrelation functions have been used to select the best input combination for the forecasting models. The quality of predicting models is evaluated using several statistical measures and graphical comparisons. For daily scale, the SVR has generated more accurate estimates than other models, Root Mean Square Error (RMSE = 3.592°C), Correlation Coefficient (R = 0.964), Mean Absolute Error (MAE = 2.745°C), and Thiels’ U-statistics (U = 0.127). Besides, the study found that both RT and SVR performed very well in predicting weekly temperature. This study discovered that the duration of the employed data and its dispersion and volatility from month to month substantially influence the predictive models’ efficacy. Furthermore, the second scenario is conducted using the randomization method to divide the data into training and testing phases. The study found the performance of the models in the second scenario to be much better than the first one, indicating that climate change affects the temperature pattern of the studied station. The findings offered technical support for generating high-resolution daily and weekly temperature forecasts using Data-Driven Methodologies.


Introduction
term memory (LSTM) network to predict temperature for both ultra short-term and short term period(hourly and one day ahead). The results showed that the LSTM model was able to efficiently predict the temperature for both the time scales. However, the LSTM has several disadvantages, such as it requires longer time and more memory to train. Besides, its parameters are difficult to assign and implement and the outcomes are vulnerable to various random weight initializations. S. Salcedo-Sanz et al. [12] used the support vector regression (SVR) and multi-layer perceptron (MLP) models to predict the mean monthly air temperature. The dataset from the monitoring stations located in New Zealand and Australia was used for the model development. The results showed that the SVR model provided the best accuracy in temperature prediction. Overall, very few studies based on daily temperature prediction have been conducted for regions with continental climatic conditions Therefore, the main objective of this study is to forecast air temperature over a continental climate case study which is in North America. Two-time scales are adopted in this study, daily and weekly. For fulfilling this task, four Data-driven models i.e., Support Vector Regression (SVR), Regression Tree (RT), Quantile Regression Tree (QRT), and Gradient Boosting Regression (GBR) have been applied. These models have been used to predict one-day and one-week temperature ahead depending on the past temperature values for both time scales (weekly and daily). Comprehensive comparisons supported by statistical measures and comparative figures have been applied to select the most efficient models.

Case study
North Dakota is located in the middle of North America and is subjected to extreme climate conditions, with hot summers and cold winters. Due to its inland location and proximity to both the North Pole and the Equator, which are almost equal, there are noticeable temperature fluctuations. Furthermore, it has been observed that the temperature varies extremely from season to season, which may be responsible for the changes in weather throughout the time [42]. Since North Dakota has a continental climate, forecasting the patterns of meteorological parameters is a challenging task. The difficulty in simulating weather parameters in such region may be due to the nature of the fluctuating climate during the seasons. Tables 1 and 2 show the statistical characteristics of the minimum, mean, average, standard deviation, and skewness of the daily and weekly air temperature values at the Crary  which means that the data are more consistent with their normal rates. Notably, the utilized data in this study are collected from the open-source website of the Crary station [43]. Finally, the location of the studied region is shown in Fig 1.

Support vector regression
Support vector regression (SVR) is considered a powerful and efficient tool based on the notion of statistical learning and was first introduced by Vapnik [44] to describe regression as a part of the support vector machine (SVM). Based on the principle of structural risk minimization (SRM), SVR has been successfully implemented in real-world challenge modeling by overcoming classification and regression tasks [45]. The linear relationship between independent variables (x 1 , x 2 , x 3 , � � �, x r ) and dependent variable (y) is given in the equation below.
Where w i and b are the weight and bias of the model, respectively. ;(x) is the higher dimensional feature space converted from the independent vector (input). These parameters can be determined by minimizing kwk 2 = (w.w) as follows under the constraints Where, C is the regularization constant, ξ i , ξ i � are the slack variables and ε is the size of the tube, "denoting the accuracy of the function to be approximated" [46].
Based on Lagrange multipliers, the standard SVR can overcome the following optimization problem. Which is subjected to: Where ρ are the cost factor, a i ; a � i � 0 are the Lagrange multiplier factors. The linear SVR can be written as follows This equation may be considered inappropriate for solving many engineering problems because of its linear characteristic, while engineering problems often need a non-linear regression analysis. Therefore, in order to switch the input data to a much higher-dimensional space, nonlinear kernel functions are utilized. In this regard, the radial bases kernel function (RBF) is used in this study and can be expressed mathematically in the equation below.
Where K(x i , x) represents the kernel function and β is the bandwidth of K(x i , x).

Regression tree and quantile regression tree
A decision tree (DT) is a supervised machine learning-based technique that uses labeled data (data with known target attributes) to carry out simulations with the help of classification and regression algorithms [47]. In general, DT's consist of three types of nodes: decision Root nodes, internal nodes, and leaf nodes, where each node or leaf denotes a class label while the branches denote the outcome of the test performed [48]. The technique splits the input dataset on the basis of the most significant splitter or differentiator in the input variables. This process of data division and selection of the most significant attribute in the dataset is governed by the classification and regression algorithms. The technique follows a top-down approach as the top portion holds all the observations at one spot, which splits into two or more branches that further split. This approach is also referred to as the greedy approach, as it only incorporates the current nodes without focusing on the future nodes [49]. The decision tree algorithm continues to run until a stop criterion such as the minimum number of observations etc., is attained. Once this criterion is achieved and a decision tree is developed, many nodes are detected as outliers which may be addressed through the tree pruning method. This, in turn, improves the forecasting accuracy of the DT-based model. In the same method that regression minimizes cost function (i.e., squared-error loss) when forecasting a single point estimate, the quantile regression tree (QRT) minimizes the loss function in forecasting a particular quantile. The median, or 50 th percentile, is the most commonly used quantile, and the quantile loss is just the sum of absolute errors in this case. Additionally, quantiles can be used as endpoints of prediction intervals; for instance, the 10 th and 90 th percentiles define an 80-percentile range in the middle. It appears that the quantile loss differs according to the evaluated quantile, with higher quantiles penalizing more for negative errors and lower quantiles penalizing more for positive errors. Accordingly, in this study, we used the median (the 50 th percentile), which is the most well-known quantile.
In the fields of artificial intelligence and search algorithms, pruning is a data compression method used to minimize the size of decision trees by deleting parts of the tree that are deemed non-critical and repetitive to the regression of instances. By assessing the predictive value of each node in a regression tree, regression tree pruning decreases the danger of overfitting. Nodes that do not increase the anticipated prediction efficiency on new data are substituted with leaves.

Gradient Boosting Regression (GBR)
GBR is an ensemble machine learning approach that enhances the prediction performance of a classical decision tree by incorporating a sequential statistical process called boosting, of which the principle idea is to combine a set of weak predictive models to form a single and high accurate predictive model [50,51]. The technique applies an iterative procedure, where the estimates of the new tree model (weak learner) are updated with the pseudo residuals (negative gradient of the loss function) of the current learner [52]. This process is repeated until the loss function of the model is reduced to a minimum value, thus improving the forecasting performance of the model.
The iterative training process of the GBRT with K decision trees can be briefly explained as follows: For a given training dataset D = {(x 1 , y 1 ), (x 2 , y 2 ),. . .. . ., (x n , y n )}, the loss function is computed as: Step 1. Initialize the new tree model (weak learner) with a constant value: Step 2. Assume the number of iterations m = 1, 2, 3. . .. . .., K (a) For i = 1,2,3. . .. . ..,N.The pseudo residuals of the i th training data is calculated as: (b) Fit a regression tree in terms of r mi , and deduce the leaf node area R ml of the m th tree. Predict the leaf node area of the decision tree to attain an approximate value of the fitting residual.
(c) For l = 1,2,3. . .. . .., L. Adopt linear search to attain the value in the leaf node range and minimize the loss function with gradient descent. The best residual fitting value of each blade is as follows: (d) Update the regression tree Step 3. Obtain the final model

Autoregressive model
The autoregressive integrated moving average (ARIMA) is a historical data-based model and is considered the most common time series modeling approach first introduced by Box and Jenkin in 1976 [53]. ARIMA model is considered a hybrid model in which the Autoregressive (AR) and moving average (MA) models generalized forms are combined for modeling nonstationary univariate time series data by approximating the time series using a mathematical model based on past and current values. The utilization of ARIMA model is done by setting the order of three terms: autoregressive, sequence difference, and moving average. The general successive difference equation for d th order can be mathematically expressed as follows.
Where d is the order of sequence difference and B is the backshift operator. The general ARIMA equation can be briefly presented as follows [54].
Where ; p (B) is the autoregressive operator of order p, θ q (B) is the moving average term of order q and w t = ΔT t .

Random forest
Random forest is a supervised machine learning technique which is made up of large number of small decision trees, known as estimators, which generate their own predictions. 'Forest' generated by the random forest algorithm is trained through bagging or bootstrap aggregating [55]. Bagging is an ensemble meta-algorithm that fine-tunes the prediction accuracy of machine learning algorithms. The (random forest) algorithm produces the output based on the predictions of the decision trees. It predicts by taking the average or mean of the output from various trees. Increasing the number of trees increases the precision of the outcome. The various advantages of this technique over other machine learning approaches such as need of less computation time, ease of working with high-dimensional data, strong fault tolerance and parallel processing make it suitable even for very high-dimensional problems like air temperature forecasting.

Model development
In this work, four regression models i.e., SVR, GBR, QRT, and RT have been used to predict the daily and weekly air temperatures over the continental climatic region of North America. The time-series data was collected from the Crary meteorological station from 2000 to 2021.
For selecting the best input lags, the autocorrelation function (ACF) and partial autocorrelation function (PACF) have been used to analyze the data. According to Fig 2, the ACF provides more information on the time series properties like stationary, trend pattern, seasonality, and randomness. The daily and weekly temperature patterns were examined to determine the most appropriate predictors utilizing correlation statistics such as ACF and PACF, respectively. The statistical techniques used the time-lagged data from temperature time series to estimate the daily and weekly intervals between the present T value and prior T value for any given observation (i.e., a time lag) [55]. Thus, selecting which lags have a significant correlation and significant information may benefit. Besides, the lags confined between upper and lower bounds are neglected because they have lower correlations and represent the white noise in the time series, which cannot be predicted. Both ACF and PACF are provided in the following

PLOS ONE
Air temperature forecasting: Mutli-time scale equations: The N is the total number of temperature records, X t and À X is the time series record at time t and the mean of the temperature records, and finally, the k is number of lags in the time series data. lower and upper limits (UP, LO) can be determined at 95% significant level by the following equation: The ACF declines more slowly concerning the daily scale, which means that the time series data is not stationary. It is challenging to select the most effective lags for daily scale using ACF because of seasonality. The PACF, similar to the ACF, shows the association between two records that the shorter delays between those observations do not describe. For instance, the partial autocorrelation coefficient for the third lag in the daily scale temperature is only a correlation that the previous short lags (lag two and lag one) have not explicitly explained. Therefore, the PACF is more suitable for selecting the input lags for predicting the short scale of time series than the ACF. Table 3 shows the input combinations used in this study. It is worth mentioning that 70% of the data is used to train the suggested models, and 30% of the data included the end of the time series of the data utilized for the testing phase and checking the models' performances (see Fig 3). The following steps summarize the primary process of developing the models for forecasting short-and mid-term air temperature. Table 3. Air temperature input design for Baker and Crary stations.

Model
Input groups Output Scale Training data records Testing data records 1. Collecting the daily data for air temperature from a continental climate. In this study, Crary station is selected.
2. Converting the temperature values from Fahrenheit to Celsius using the following formula.
T C and T F are temperatures measured in degrees Celsius and Fahrenheit. 3. Computing the mean weekly temperatures. 4. Selecting the best lags using ACF, and PACF for both scales (weekly and daily). , and the rest are used for testing. The number of data points was fixed in that case to ensure a fair evaluation of the proposed model throughout the most critical step (testing phase). Notably, this procedure does not affect the data partitions. For example, for daily scale 2411, which represents about 30% of the entire daily records (Table 3) 6. Normalizing the training and testing dataset based on the minimum and maximum temperature in the training data set using the following formula [56]: T normalized is the normalized temperature for i th temperature record (T i ) while T min and T max are minimum and maximum and temperature data obtained from the training data set. 7. Assigning the hyperparameters of the applied models. The trial-and error-method is used for this process where each model was trained 100 times over the training dataset with different parameters. When these models were trained several times, the best ones were selected according to the statistical criteria. According to several statistical metrics, the model which generates lowest forecasting error in the training step is selected. In addition, the performance of the model should be stable so that there is no significant difference between its performance in the training and testing phase.
8. De-normalizing the data based on the following formula: 9. Evaluating the accuracy of the models with the testing dataset.
It is important to mention that all models are constructed using MATLAB software. The candidate parameters of the applied models can be illustrated below: • RF: The number of trees is selected between 20-100 while the leaf node ranges from 1-5.
• SVR: Box Constraints "regularization parameter "is set between 0.7 to 1. mean the sigma ranges from 0.8452 to 0.7071.
• Epsilon parameter 2 [0. 6 1] and sigma. Finally, the kernel scale parameter ranges from 0.8 to 1. • Bag fraction = 1 for assembling models that's mean "roughly 2/3 of input data is selected for training for every tree and the remaining 1/3 is used as out-of-bag observations".
In the second scenario of this work, we used the randomization method to divide the data into the training and testing phase. In this scenario, the effect of the climate would be more obvious on the performance of the AI models. Accordingly, the models in this scenario will be trained using some recent records.

Statistical metrics
Different statistical metrics assess the best model accuracy in daily and weekly temperature forecasting. Furthermore, it is vital to recognize the most efficient model with the least forecasted error. Four statistical measures can be adopted to examine the forecasting accuracy of the suggested modeling approaches, such as root mean square error (RMSE), correlation coefficient (R), Thiels' U-statistics (U), and mean absolute error (MAE). The mathematical expressions of these measures are presented below [57,58].
Þ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi P N i¼1 ðA i À A À Þ 2 q ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Where, A i , and P i are the actual and forecasting temperature for i th observation. While A − , and P − are the mean of actual and predicted value, and N is the total number of observations. The stated statistical parameters have been used frequently in the literature for model comparison, and their estimation can be achieved directly from the observed and predicted values. Based on the results of these statistical measures, the model presenting the lowest value of forecasting error and the highest value of R (close to one) is selected as the best model for predicting the air temperature for short-and mid-term forecasting.

First scenario
This scenario investigates the capability of the AI models to predict the one-step ahead values for daily and weekly temperature. In this part of the work, we divide the data into two phases: training and testing. The classical method is used to separate the data into two steps; the first 70% of the recorded temperature is used for training, and the last 30% of the data is used for testing. In this scenario, the effect of the current temperature trends is not considered. In other words, the current records of temperatures are used in the testing phase. Thus, the models are tested and evaluated based on their ability to predict the current temperature values of the time series data. Furthermore, the input lags were determined by AFC and produced to the adopted models like RF, SVR, GBR, RT, and QRT. Different statistical parameters and comparable figures are used to assess the models' performances. This part discusses the performance of the proposed models for predicting the daily and weekly temperature over the training and testing phases for different input lags. Tables 4 and 5 show the performance of the proposed models during the training phase for both daily and weekly temperature prediction. For daily forecast, all models provide satisfactory results. Nevertheless, the RF provides the best performance, with RMSE ranging from 1.807 to 2.824, R ranging from 0.978 to 0.991, U ranging from 0.065 to 0.101, and MAE ranging from 1.389 to 2.157, followed by the QRT model with RMSE ranging from 2.0874 to 3.1830, R ranging from 0.9722 to 0.9882, U ranging from 0.0745 to 0.1134, and MAE ranging from 1.3492 to 2.24, and RT model with RMSE ranging from 2.3687 to 3.09, R ranging from 0.9738 to 0.9849, U ranging from 0.0849 to 0.1106, and MAE ranging from 1.7951 to 2.374. While GBR and SVR models came last with RMSE ranging from 3.4233 to 3.6633 and from 3.6821 to 3.7887, R ranging from 0.963 to 0.9677 and from 0.9604 to 0.9626, U ranging from 0.1229 to 0.1328 and from 0.1321 to 0.1353, and MAE ranging from 2.6659 to 2.8746 and from 2.8369 to 2.8965, respectively. Furthermore, increasing the number of input lags increases the accuracy of the QRT, RT, and SVR models. Here the QRT model reaches the optimum accuracy when the input lag is nine (QRT − M9), the RT model with eight input lags (RT − M8), SVR model with ten input lags (SVR − M9) and the GBR model with five input lags (GBR − M5). On the contrary, the RF model only requires one input lag to reach its optimum performance RF-M1.
For weekly temperature prediction, all models provide satisfactory predictions reaching the best performance with the QRT model with RMSE ranging from 2.5069 to 3.7157, R ranging from 0.9589 to 0.9817, U ranging from 0.0932 to 0.11386, and MAE ranging from 1.6300 to 2.5880, followed by RF model with RMSE ranging from 2.799 to 3.647, R ranging from 0.960 to 0.977, U ranging from 0.105 to 0.136, and MAE ranging from 2.116 to 2.787. Moreover, increasing the input lags for weekly prediction improves the accuracy of QRT, RF, RT, and SVR models which allows them to reach their optimum accuracy (QRT − M16, RF − M16,

PLOS ONE
RT − M16, SVR − M16). It is observed that any increase in the number of lags beyond a value of seven imparts a negative effect on the model performance. At the same time, the GBR model requires five lags (same as the daily prediction) to reach its optimum accuracy (GBR − M14). Overall, QRT, RF, RT, and SVR models better predict daily temperature than weekly during the training phase. On the other hand, the GBR model better predicts the weekly temperature. Based on the training phase, all models perform very well in predicting the daily and weekly temperature. However, the assessment of the model performances based on the testing dataset is also crucial. For the training phase, the models are provided with complete data (input and targets), which may result in overfitting. Thus, excluding the models' performances in the testing phase may provide users with misleading results. It is known that in the testing phase models received only input features and thus the forecasting accuracy would be more reliable than in the training phase [46,59].
During the testing phase, the performance of the proposed models was assessed firstly by comparing the performance with each other and secondly by comparing the performance with the ARIMA model as a benchmark model. Table 6 shows the performance of the proposed models during the testing phase for daily temperature prediction. For daily forecast, the RF model provides the best performance, with RMSE ranging from 1.776 to 3.765, R ranging from 0.960 to 0.991, U ranging from 0.063 to 0.133, and MAE ranging from 1.353 to 2.898 followed by the SVR model with RMSE ranging from 3.5915 to 3.6599, R ranging from 0.9621 to 0.9635, U ranging from 0.1265 to 0.1288, and MAE ranging from 2.7451 to 2.7902. Moreover, the RF model requires only one input lag (RF − M1) to reach the best accuracy, while the SVR model requires five input lags. On the other hand, despite the RT and QRT models showing the best performance during the training phase, they came last during the testing phase as they have a tendency to overfit during the training phase.
For further assessment, the ARIMA model was implemented for daily and weekly predictions using two different scenarios. The first one, the ARIMA used for the prediction of temperature using a raw data set. However, the second one, data preprocessing, is used to improve the capacity of ARIMA. At that stage, the differencing method is used to remove seasonality. It is possible to utilize that method to get rid of the temporal reliance, also known as the series dependence on time. The best prediction results are used as a benchmark to validate the AI models. Itis important to mention that the time series data became smoother after the application of differencing transformation technique (see Fig 4a and 4b). Considering the PACF For weekly temperature prediction, as presented in Table 8, the RF model demonstrates excellent performance in weekly temperature prediction by providing high prediction accuracy with R ranging from 0.933 to 0.982 and fewer prediction errors with RMSE ranging from 2.478 to 4.665, U ranging from 0.091 to 0.173 and MAE ranging from 1.874 to 3.614 compared to the other models. Furthermore, the high performance of the RF model was achieved using only one input lag (RF − M10), while the other models require significantly higher input lags (seven lags) to reach their optimum performance (SVR − M16, RT − M16, GBR − M16, QRT − M16) and increasing the lags beyond seven tends to reduce the models' performance. Overall, the proposed models performed slightly better in predicting the daily temperatures than the weekly ones.
On the other hand, the performance of the ARIMA model for weekly temperature prediction is presented in Table 9. Notably, the differencing method smoothens the time series data by removing a seasonal signal from a series (see Fig 5a and 5b). According to Fig 5c, based on PACF, three lags (1, 2, and 3 weeks) have been considered for the model development. As shown in Table 9, the performance of the ARIMA model is significantly lower than the RF model, and the latter has been able to increase the prediction accuracy by 5.36% in terms of R and reduce the prediction error by 48%, 47%, and 48% in terms of RMSE, U, and MAE.   Furthermore, the ARIMA model requires only one input lag (ARIMA 1,1,1) for weekly prediction to reach its best performance. For further assessment, we compared the performance of the best models obtained from the daily prediction (SVR − M5, RT − M5, GBR − M6, QRT − M9, and RF-M1) for each month of the year. In other words, the daily temperature prediction may vary from month to month, so it is essential to investigate the performance of the applied models for each month. What supports the importance of conducting this investigation is the considerable variation in temperature during the months of the year (see Table 2). It can be observed that the Standard deviation (St.D) varies from 2.916 to 7.905. The other significant indicator is that the data length varies monthly (see Fig 6). Fig 7 shows the performance of the best models based on RMSE statistics for each month of the year. After the training process was completed, the performance of each model was assessed individually. In general, statistical metrics such as RMSE provide the model's overall evaluation. Therefore, this figure is created to see the monthly performance of each model. It is observed that the models have faced problems in estimating the temperatures for the winter

PLOS ONE
season. According to the Fig 7, the RF and SVR model provided the least amount of prediction error for almost all months, followed by the GBR model. It can be observed that the highest forecasted error is observed in January, February and December. Two reasons may efficiently explain this problem. The first reason may be associated with variability of temperature in the  In terms of the total number of days, February is the shortest month. It can be observed from Fig 7 that the number of data records used in this month constitutes only 7.74% of the total data, which is undoubtedly the lowest percentage of data used in this study. Therefore, the models do not have enough training to simulate that period of the year in which the temperature changes significantly within a short period. Furthermore, the RF provides better efficiency in predicting the temperatures measured in February. The model presents less variance in comparison to the other models. A further noticeable observation related to daily temperature prediction is the fact that all the models except the QRT model require fewer input lags to reach their optimum performance (RF − M1, SVR − M5, RT − M5, GBR − M6), while the QRT model requires more input lags (nine lags) to achieve the optimum performance (QRT − M9).
The performance of the proposed models during the testing phase is also assessed using scatter diagrams (see Figs 8 and 9), histograms (see Figs 10 and 11), and box plots (see Figs 12 and 13). Figs 8 and 9 represent the scatter plot between the observed and predicted temperatures for daily and weekly prediction. The plots examine the cause-effect relationship between the predicted and the observed temperatures and check the degree of association between these two variables in terms of coefficient of determination (R 2 ). For daily prediction, the RF model yielded the best prediction performance in terms of R 2 � 0.983, while the other models provided slightly similar performance in terms of R 2 . Additionally, for all data samples, there is considerably less diversion with the ideal line for the RF model compared to the other models. For weekly prediction, the RF model still demonstrated a robust prediction performance with a significantly higher R 2 values (R 2 � 0.964) compared to the other models. At the same time, the RF model showed the least diversion with the ideal line for all data samples compared to the other models.
Figs 10 and 11 show the histogram plots for the forecasting error in the case of both horizons (i.e., daily and weekly) during the testing phase. The plot visually interprets the error distribution by showing the number of error values within a specified range and includes the Gaussian kernel density function to check the error normality. From Figs 10 and 11, it can be inferred that the RF model performs better than the other models in terms of mean error and standard deviation for daily temperature and weekly predictions and provides an error distribution similar to the normal distribution. Moreover, box plots are also constructed to depict the distribution and skewness of forecasting error values by displaying quartiles and averages. The plots display the values in a standardized manner using a five-number summary (i.e., minimum, first quartile, median, third quartile, and maximum) and present more visual information regarding the effectiveness of each model separately. The figures help to better understand the characteristics of forecasting errors generated by the applied models. For daily scale, all models provide the same outlier values, slightly less for the RF − M1 model (see Fig 12a). The quantile of the measured errors is provided in Fig 12b. Accordingly, the RF − M1 model generates a lower interquartile range (IQR = 3.93) than the other models, indicating the efficiency of predicting the daily temperature. For the weekly time scale, the RF − M1 model shows the best performance because its median and mean values are very close to zero compared to other models (Fig 13a). Besides, the generated outliers are fewer than those reported in other models. The most important note can be observed in Fig 13b which shows that the RF-M1 model generates significantly fewer outliers (IQR = 2.554) in comparison to the other models whose IQR ranges from 4.738 to 5.353.
For further evaluation, the residual error diagrams for both daily and weekly scales have been developed (Figs 14 and 15). The diagram acts as a performance measure for the applied models and represents the difference between the forecasted and the actual temperature values. It is observed from Figs 14 and 15 that the RF model demonstrates the least residual error in comparison to all other applied models and outmatches them in terms of prediction accuracy and performance.
Lastly, the capacity of the predictive models has been investigated through the hottest months (June, July, and August). These months have the highest temperatures; thus, it is vital to see which applied AI models mimic the extreme temperature values. For that, the probability records (data points), which have lined at 95% confidence interval (mean ± standard deviation), have been computed; it can be seen from Fig 16 that only the RF-M1 model managed to generate more excellent performance than the comparable models. Moreover, the SVR-M5 model could not deal well with the high-temperature values in the hottest months for this study area.

Second scenario
The previously described findings were obtained when data from 2000 to 2015 were utilized for training the applied models. This data counted for 70% of the total records and the rest of the measured data, which represented 30% of the entire data points, was used for testing the models. This type of data division helps to test how these models can simulate the pattern of data recorded in recent years. It is known that the world, in not a few parts of it, is facing a global warming crisis and the time series of temperatures studied in the last decade have shown a behavior and pattern which differs somewhat from what they were observed in the previous years. Accordingly, this study investigates how climate change affects the temperature records. The current records of temperature were used in both the training and testing phases. To do that, the data records are randomly divided into two phases: training (70%) and testing

PLOS ONE
Air temperature forecasting: Mutli-time scale (30%). After that, the data-driven models (QRT, GBR, RF, and RF) were trained and assessed using several statistical metrics based on this data division. Furthermore, the outcomes of these models are compared with their corresponding results, which have already been discussed. This technique may help observe whether there is an effect on the behavior and accuracy of the model predictions when using recent time series data during the training process. According to the results shown in Table 10, all the predictive models' performances are positively affected using the Randomization method approach. For example, when classical data division procedure was used the RF model generates relatively higher errors (RMSE = 1.776, U = 0.063, and MAE = 1.353) and however, the prediction accuracy is slightly enhanced and the model provides lower forecasting error (RMSE = 1.697, U = 0.061, and MAE = 1.325). Overall the Randomization method has a role in improving the model capacity because it includes features related to future temperature trends in the training data used to train the suggested models in this work.

Conclusion
The accuracy of the Data-Driven Models, namely RT, SVR, QRT, RF, ARIMA, and GBR, have been investigated to forecast atmospheric air temperature on different time scales (daily and weekly) using historical meteorological data. The data was collected from Cray station, located in North Dakota, USA. This region experiences a volatile continental climate throughout the year. The time-series data is relatively long (2000 to 2021), 70% of the data are used for model calibration (2000 to 2015), and the rest are used for testing. Several input groups were examined with different times of lags. The daily scale showed that the RF technique provided more accurate outcomes than the comparable models.
Moreover, the advanced analysis of forecasting error exhibited that the performance of the models was significantly affected by data variability, consistency, and extreme temperature values. As January, February and December had higher variability of temperature data values, the effect on the model performance was greater for these months. In addition to this, the forecasting errors observed for these months were higher than other months due to the fact that the average temperature observed for these months fell below the overall average temperatures observed for the entire dataset.
The models performed very well for the weekly time scale, but the RF, modeling technique provided more accurate results compared to other models. In general, the accuracy of daily forecasting temperature was higher than the weekly scale. This may be because the weekly records were calculated by taking the average temperatures for seven days, which led to the loss of some critical data characteristics. Furthermore, as the weekly scale was derived from the daily records, the length of the time-series data reduced significantly, which affected the efficiency of the model during the training process.
This study also investigated the AI models' capacity to predict temperature when the future pattern data is included. In this scenario, the randomization data division was applied to divide the data into training and testing. The study found that the prediction models' performance was enhanced after using these techniques. This means that the current pattern of the temperature data affecting climate change influences the quality of predictions. Besides, the case study location starts to be gradually affected by climate change and its impact on temperature values.
Thus, this study suggests the following recommendations: • Adopting a robust approach to determine the best input combination instead of existing (ACF and PACF) methods. • Applying a Bio-inspirited algorithm to select the optimal hyperparameters of SVR • Studying to what extent the size of the data used to train the performed models affects the accuracy of predictions. This task can be accomplished using different training and testing rations.